Abstract

Author: Charles Tapley Hoyt

This notebook outlines the systematic assessment of errors in a BEL document. The data used is from the Alzheimer's Disease (AD) knowledge assembly model that has been annotated with the NeuromMMSig Database. Error analysis is not meant to place blame on contributors to a BEL document, but rather to inform curation leaders where recuration efforts should be focused and to make analysts aware of the issues with a given BEL document.

Notebook Setup


In [1]:
import logging
import os
import re
import time
from collections import Counter, defaultdict
from operator import itemgetter

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from fuzzywuzzy import process, fuzz
from matplotlib_venn import venn2

import pybel
from pybel.constants import PYBEL_DATA_DIR
from pybel.manager.cache import CacheManager
from pybel.parser import MetadataParser

import pybel_tools as pbt
from pybel_tools.utils import barh, barv

In [2]:
%config InlineBackend.figure_format = 'svg'
%matplotlib inline

In [3]:
logging.getLogger('pybel.cache').setLevel(logging.CRITICAL)

Notebook Provenance

The time of execution and the versions of the software packegs used are displayed explicitly.


In [4]:
time.asctime()


Out[4]:
'Tue Apr 25 11:07:43 2017'

In [5]:
pybel.__version__


Out[5]:
'0.5.4-dev'

In [6]:
pbt.__version__


Out[6]:
'0.1.8-dev'

Local Path Definitions

To make this notebook interoperable across many machines, locations to the repositories that contain the data used in this notebook are referenced from the environment, set in ~/.bashrc to point to the place where the repositories have been cloned. Assuming the repositories have been git clone'd into the ~/dev folder, the entries in ~/.bashrc should look like:

...
export BMS_BASE=~/dev/bms
export BANANA_BASE=~/dev/banana
export PYBEL_RESOURCES_BASE=~/dev/pybel-resources
...

BMS

The biological model store (BMS) is the internal Fraunhofer SCAI repository for keeping BEL models under version control. It can be downloaded from https://tor-2.scai.fraunhofer.de/gf/project/bms/


In [7]:
bms_base = os.environ['BMS_BASE']

PyBEL Resources

PyBEL resources is a set of namespaces and annotations that have been made avaliable by the AETIONOMY project through collaboration between the PyBEL Core team and the NeuromMMSig Database developers. It can be downloaded from https://github.com/pybel/pybel-resources


In [8]:
pybel_resources_base = os.environ['PYBEL_RESOURCES_BASE']

Data

The Alzheimer's Disease Knowledge Assembly has been precompiled with the following command line script, and will be loaded from this format for improved performance. In general, derived data, such as the gpickle representation of a BEL script, are not saved under version control to ensure that the most up-to-date data is always used.

pybel convert --path "$BMS_BASE/aetionomy/alzheimers.bel" --pickle "$BMS_BASE/aetionomy/alzheimers.gpickle"

The BEL script can also be compiled from inside this notebook with the following python code:

>>> import os
>>> import pybel
>>> # Input from BEL script
>>> bel_path = os.path.join(bms_base, 'aetionomy', 'alzheimers.bel')
>>> graph = pybel.from_path(bel_path)
>>> # Output to gpickle for fast loading later
>>> pickle_path = os.path.join(bms_base, 'aetionomy', 'alzheimers.gpickle')
>>> pybel.to_pickle(graph, pickle_path)

In [9]:
pickle_path = os.path.join(bms_base, 'aetionomy', 'alzheimers', 'alzheimers.gpickle')

In [10]:
graph = pybel.from_pickle(pickle_path)

In [11]:
graph.version


Out[11]:
'3.0.2'

Error Analysis

As stated in the pybel.BELGraph documentation, all warnings during BEL compilation are stored in a list. These contain information about the BEL statement, the line number, the type of error, and the current annotations in the parser at the time of error. PyBEL tools makes many functions avaliable for systematically analyzing these errors.

The total number of errors is listed below.


In [12]:
len(graph.warnings)


Out[12]:
3549

The types of errors in a graph and their frequencies can be calculated using pbt.summary.count_error_types.


In [13]:
error_counter = pbt.summary.count_error_types(graph)

In [14]:
barh(error_counter, plt)


A common type of error is to use names that aren't contained within a namespace. These are thrown with pybel.parser.parse_exceptions.MissingNamespaceNameWarning. The PyBEL Tools function pbt.summary.calculate_incorrect_name_dict creates a dictionary for each namespace which incorrect names were used and their frequencies.


In [15]:
incorrect_name_dict = pbt.summary.calculate_incorrect_name_dict(graph)

Using pbt.utils.count_dict_values, the number of unique incorrect names for each namespace is extracted and plotted.


In [16]:
barh(pbt.utils.count_dict_values(incorrect_name_dict), plt)


Another common error is to write identifiers without a namespace, or for short, a naked name. The function pbt.summary.count_naked_names returns a counter of how many time each naked name appeared.


In [17]:
naked_names = pbt.summary.count_naked_names(graph)

The number of unique naked names can be directly calculated with len() on the resulting counter from pbt.summary.count_naked_names.


In [18]:
len(naked_names)


Out[18]:
210

The 25 most common naked names are output below.


In [19]:
naked_names.most_common(25)


Out[19]:
[('amyloid beta peptides', 310),
 ('p', 16),
 ('Microglia', 15),
 ('AICD', 13),
 ('NICD', 12),
 ('YENPTY endocytosis motif (APP)', 12),
 ('Platelets', 11),
 ('2-butenal', 11),
 ('Beta cell', 10),
 ('mitochondrial dysfunction', 9),
 ('BACE1,gmod(M)', 9),
 ('APOE e2', 7),
 ('N-methyl-D-aspartate selective glutamate receptor activity', 7),
 ('AC253', 6),
 ('L1RE1,gmod(M)', 6),
 ('Amyloid beta-peptides', 6),
 ('N-APP', 5),
 ('astrogliosis', 5),
 ('neuroinflammation', 5),
 ('MTHFR,sub(C,677,T)', 5),
 ('MTHFR,sub(T,677,T)', 5),
 ('BDNF,gmod(M)', 5),
 ('17A', 4),
 ('chemokines', 4),
 ('chronic CNS injury', 4)]

Overall, the same error made multiple times is grouped to identify the most frequent errors.


In [20]:
error_groups = pbt.summary.group_errors(graph)

error_group_counts = Counter({k:len(v) for k,v in error_groups.items()})
error_group_counts.most_common(24)


Out[20]:
[('[pos:2] "amyloid beta peptides" should be qualified with a valid namespace',
  153),
 ('"alpha-soluble amyloid precursor protein" is not in the ADO namespace', 63),
 ("Missing citation; can't add: UNSET MeSHAnatomy", 55),
 ("Missing citation; can't add: UNSET Species", 47),
 ('"Neuroinflammation" is not in the MESHD namespace', 42),
 ('"copper sulphate(5.H2O)" is not in the CHEBI namespace', 37),
 ('"cerebrolysin" is not in the PMICHEM namespace', 34),
 ('"nefiracetam" is not in the CHEBI namespace', 34),
 ("Missing citation; can't add: UNSET Disease", 34),
 ('[pos:25] "amyloid beta peptides" should be qualified with a valid namespace',
  31),
 ("Missing citation; can't add: UNSET Subgraph", 29),
 ('Missing citation; can\'t add: SET Species = "10090"', 28),
 ('Missing citation; can\'t add: SET Disease = "Alzheimer Disease"', 27),
 ('"cytokine activity" is not in the GOBP namespace', 26),
 ('"neuroprotection" is not in the PMIBP namespace', 26),
 ('"synaptic transmission" is not in the GOBP namespace', 24),
 ('Missing citation; can\'t add: SET MeSHAnatomy= "Brain"', 23),
 ('"MeSHAnatomy" is not set, so it can\'t be unset', 22),
 ('Missing citation; can\'t add: SET Species = "9606"', 20),
 ('"excitatory synapse" is not in the GOBP namespace', 17),
 ('Abundance GOCC:inflammasome complex should be encoded as one of: Complex',
  17),
 ('"catalysis of free radical formation" is not in the GOBP namespace', 15),
 ('"positive regulation of clathrin-mediated endocytosis" is not in the GOBP namespace',
  15),
 ('"Cell" is not set, so it can\'t be unset', 15)]

Error Analysis by Annotation

It might be useful to group the errors by a certain annotation/value pair. In these examples, the NeuroMMSig Database subgraph annotations are used. Ultimately, the error frequency will be compared to the size of each subgraph.

First, the sizes of the top 30 largest subgraphs are shown below.


In [21]:
size_by_subgraph = pbt.summary.count_annotation_values(graph, 'Subgraph')

In [22]:
plt.figure(figsize=(10, 3))
barv(dict(size_by_subgraph.most_common(30)), plt)
plt.yscale('log')
plt.title('Top 30 Subgraph Sizes')
plt.show()


The list of all errors for each subgraph can be calculated with pbt.summary.calculate_error_by_annotation.


In [23]:
error_by_subgraph = pbt.summary.calculate_error_by_annotation(graph, 'Subgraph')

These data are aggregated to a count of the number of items in each list with pbt.utils.count_dict_values. The top 30 most error-prone subgraphs are shown below.


In [24]:
error_by_subgraph_count = pbt.utils.count_dict_values(error_by_subgraph)

plt.figure(figsize=(10, 3))
barv(dict(error_by_subgraph_count.most_common(30)), plt)
plt.yscale('log')
plt.ylabel('Errors')
plt.title('Top 30 Most Error-Prone Subgraphs')
plt.show()


Finally, the size to error ratio is calculated for each subgraph below. The 25 subgraphs with the highest error to size ratio are shown.


In [25]:
subgraphs = sorted(size_by_subgraph)
df_data = [(size_by_subgraph[k], error_by_subgraph_count[k], error_by_subgraph_count[k] / size_by_subgraph[k]) for k in subgraphs]

df = pd.DataFrame(df_data, index=subgraphs, columns=['Size', 'Errors', 'E/S Ratio'])

df.to_csv('~/Desktop/errors.tsv', sep='\t')
df.sort_values('E/S Ratio', ascending=False).head(25)


Out[25]:
Size Errors E/S Ratio
Syndecan subgraph 15 38 2.533333
Neurotransmitter release subgraph 6 8 1.333333
Neurotrophic subgraph 56 63 1.125000
Amylin subgraph 20 14 0.700000
Metabolism of steroid hormones subgraph 8 5 0.625000
CRH subgraph 28 17 0.607143
Axonal transport subgraph 33 18 0.545455
Neuroprotection subgraph 49 26 0.530612
Protein biosynthesis subgraph 2 1 0.500000
Glutamatergic subgraph 222 95 0.427928
Prostaglandin subgraph 60 25 0.416667
Autophagy signaling subgraph 24 10 0.416667
Alpha 2 macroglobulin subgraph 23 9 0.391304
Cytokine signaling subgraph 57 21 0.368421
Non-amyloidogenic subgraph 459 161 0.350763
JAK-STAT signaling subgraph 47 16 0.340426
GABA subgraph 154 51 0.331169
Response to oxidative stress 97 27 0.278351
Synaptic vesicle endocytosis subgraph 79 21 0.265823
Blood vessel dilation subgraph 23 6 0.260870
NMDA receptor 138 36 0.260870
Lipid peroxidation subgraph 27 7 0.259259
Response DNA damage 8 2 0.250000
Acetylcholine signaling subgraph 348 85 0.244253
Androgen subgraph 42 10 0.238095

The overall distribution of subgraph sizes and error counts is shown below. It indicates that they trend with a positive correlation, but there are clear outliers showing careful curation for large subgraphs, and sloppy curation for smaller ones.


In [26]:
sns.lmplot('Size', 'Errors', data=df)
plt.title('BEL Errors as a function of Subgraph Size')
plt.show()


Conclusions

The BEL language lacks utility for curator provenance. When multiple curators contribute to a single document, it's often difficult to trace where errors originate and identify the individual responsible for fixing them. When work is contracted, error analysis is also crucial for assessing the quality of the curation. PyBEL makes error messages programatically accessible and easy to summarize.

These functions were used to build a web interface to give feedback to BEL curators who are not comfortable with programming. The code and instructions for deployment are avaliable at https://github.com/cthoyt/pybel-web-validator.